In my Finding Things chapter of Beautiful Code, the first complete program is a little Ruby script that reads the ongoing Apache logfile and figures out which articles have been fetched the most. It’s a classic example of the culture, born in Awk, perfected in Perl, of getting useful work done by combining regular expressions and hash tables. I want to figure out how to write an equivalent program that runs fast on modern CPUs with low clock rates but many cores; this is the Wide Finder project.
Fragments in the Wide Finder Project series:
It’s already obvious that Web-oriented frameworks in general, and Java EE in particular, run like a bat out of hell on processors like our Niagara boxes and especially the systems that will be built around the new T2.
But you know, there’s lots of interesting computing that doesn’t happen in Web frameworks. Like, for example, logfile processing. And since more and more computers are going to start looking more and more like our Niagara-family boxes, we’re going to need to figure out how to get this kind of basic unglamorous essential job to run fast on them.
Whatever it is we use to write the many-core-friendly Intergalactic P2P Adaptive Semantic Web4.0 killer apps of the future, well, it has to be a general-purpose programming language, because they always win. Which means it has to have a good way to read a logfile and compute the most popular fetches. You might argue that this program is kind of heavy on I/O and text processing. That’s correct; but so is computing in general.
For reference, here’s the little Ruby program:
counts = {}
counts.default = 0
ARGF.each_line do |line|
if line =~ %r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) }
counts[$1] += 1
end
end
keys_by_count = counts.keys.sort { |a, b| counts[b] <=> counts[a] }
keys_by_count[0 .. 9].each do |key|
puts "#{counts[key]}: #{key}"
end
Several have pointed out that I should have used sort_by
at
the bottom. Whatever.
It took this code 7½ seconds of CPU, 13½ seconds elapsed, to process a million and change records, a quarter-gig or so, on last year’s 1.67Ghz PowerBook. A Perl version was approximately twice as fast.
What I’m looking for is something that’s approximately as compact and readable, and runs many times faster on a modern many-core meat-grinder. Should be fun.
Comment feed for ongoing:
From: Joe Cheng (Sep 21 2007, at 17:21)
The Perl version was "twice as fast" meaning half as much CPU time, or half as much total time? Because if the latter, doesn't that mean you're already totally dominated by IO?
[link]
From: Tim Bray (Sep 21 2007, at 17:35)
Joe: About half as much, both CPU and elapsed. On that particular machine, perl's I/O and regexp/hash implementations were both faster.
[link]
From: Greg Janée (Sep 21 2007, at 20:56)
Not sure about the mechanics, but on a conceptual level at least, seems a perfect fit for the map/reduce paradigm, http://labs.google.com/papers/mapreduce.html .
[link]
From: Tom White (Sep 22 2007, at 09:24)
This application is indeed a perfect fit for the MapReduce paradigm - I wrote an article showing how to write a MapReduce program to analyze web logs using Hadoop here: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873&categoryID=112.
The article differs from what you are doing in that it uses Amazon EC2 to run Hadoop on rather than a single box with multiple cores. Hadoop has really been optimized for multiple machines rather than multiple cores on one machine - but it can run in that configuration. Also, the program in the article doesn't compute the most popular URLs, although this is easy to write - see the examples that come with Hadoop for inspiration.
Is it as compact and readable? It's not as compact, but that's mainly due to Java not being as expressive as Ruby. If you wrote it in Pig (http://research.yahoo.com/project/pig) it would probably be shorter than the Ruby version. Readability is harder to measure, but it's probably safe to say the MapReduce/Pig version is less familiar and therefore less easy to grasp than the Ruby version. So what would be cool is something that understands that the Ruby code is "MapReduceable", translates it into a MapReduce program and runs it transparently, exploiting all the cores available. I wonder, could a virtual machine do this one day?
[link]
From: Jon (Sep 24 2007, at 10:47)
Shouldn't you define the regex outside the each loop?
[link]
From: Seth W. Klein (Sep 24 2007, at 14:19)
Jon: my test suggests not:
--- ~ S
for i in 1 2 3 4; do time ruby -e 'r=%r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) }; 1_000_000.times { r }'; done
real 0m1.539s
user 0m1.496s
sys 0m0.040s
real 0m1.544s
user 0m1.484s
sys 0m0.060s
real 0m1.544s
user 0m1.496s
sys 0m0.048s
real 0m1.545s
user 0m1.484s
sys 0m0.064s
--- ~ S
for i in 1 2 3 4; do time ruby -e '1_000_000.times { %r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) } }'; done
real 0m1.527s
user 0m1.488s
sys 0m0.036s
real 0m1.725s
user 0m1.656s
sys 0m0.068s
real 0m1.530s
user 0m1.492s
sys 0m0.032s
real 0m1.550s
user 0m1.496s
sys 0m0.052s
--- ~ S
[link]
From: CB (Sep 25 2007, at 04:53)
In Java processing 1 million lines in one thread takes about 4 seconds. Just the IO to read in those lines takes about 3 seconds. This suggests that this problem is not worth parallelizing.
I tried the simplest multithreaded approach in Java. The main thread sticks lines on a queue and a another thread pulls them off and the regexp match etc. The performance of this approach (on a 4 cpu box) is almost twice as slow as the single threaded solution. By collecting batches of lines and putting them on the queue the performance gets to around the same level as the single thread. This is not because Java's concurrency support is bad. The util concurrent blocking queues are very efficient (probably as efficient as the message queues for Erlang processes). The amount of work done for each line is just so little that the overhead of locking and queue operations are just not worth it.
The only sensible concurrency that you could get for this problem is by storing different day's logs on different disks and spawning a thread/task/process for each log file, finally combining the maps at the end.
[link]
From: Anthony Starks (Sep 25 2007, at 11:21)
Speaking of the comparison of regular expression implementations, see:
http://swtch.com/~rsc/regexp/regexp1.html
bottom line: the implementations in common scripting languages is inferior to classic implementation of programs like awk.
[link]
From: Pete Kirkham (Sep 25 2007, at 14:15)
> In Java processing 1 million lines in one thread takes about 4 seconds. Just the IO to read in those lines takes about 3 seconds. This suggests that this problem is not worth parallelizing.
Arguably, the number of hours spent reading this thread exceeds the CPU time spent running code by several orders of magnitude, so in terms of net efficiency it suggests using the first scraggy Perl script that did the job.
On my machine, (after resetting the disk cache by reading some other large file), scanning for the strings in C (if you write java using char or byte arrays rather than j.l.String you should get similar performance) gives:
real 0m7.602s
user 0m1.620s
sys 0m0.232s
and running just the IO:
real 0m8.509s
user 0m0.000s
sys 0m0.352s
So on that basis, there is a big user time which would manifest itself if the disk were faster. I've only one core, so the whole poly-core thing is also hypothetical.
Laptops with solid state drives are already on the market; running the same with the files already in cache gives an idea of the times:
string matching, file cached in memory:
real 0m1.255s
user 0m1.124s
sys 0m0.108s
just IO, file cached in memory:
real 0m0.247s
user 0m0.004s
sys 0m0.216s
(obviously single samples are all over the place and only order-of-magnitude data points)
So if we're talking about most machines being poly-core, then you probably need to shift to a different architecture - a smaller solid state disk for each core, a file system that doesn't expose disks to the user - a federation of systems rather than just multiplying the CPU in a current machine architecture. Maybe http://millicomputing.blogspot.com/ will come up with a federated model. ZFS is someway there for the file system. Or something closer to an FPGA which lets you shift between cores and stores depending on your problem space. Single core machines are already fast enough to often be limited by disk or memory bandwidth in data processing algorithms.
On the other hand, if you are using a language whose implementation is taking as much CPU time to process the file as the disk needs real time to read it, then there's an issue with that VM.
I'd hope VMs get better, and maybe a few multilanguage VMs dominate - none of the current ones I know of are efficient poly-threaded implementations though. I'd expect magnetic disks to get either bigger caches or replaced by solid state drives (much as memory sticks have replaced floppies then grew beyond that size) before poly-core machines become the most efficient point for speed improvements in this problem space.
[link]
From: CB (Sep 26 2007, at 09:52)
I am super stupid. I just realised I had confused the meanings of user and real time. My Java concurrent version was faster when using 2 threads and batches of at least 100 lines.
I also discovered that the Java IO is around 30-40% faster if you use an InputStreamReader rather than a FileReader, and specify that your character set is US-ASCII rather than accepting the the default.
Some numbers for 1 million lines:
IO only with one thread:
real 0m2.267s
user 0m2.329s
sys 0m0.210s
Single thread:
real 0m3.208s
user 0m3.460s
sys 0m0.215s
2 Threads passing batches of 500 lines on a queue:
real 0m2.597s
user 0m4.137s
sys 0m0.294s
2 Threads using 2 pre-constructed lists for the batches of 500 and passing them through an Exchanger:
real 0m2.649s
user 0m4.171s
sys 0m0.277s
[link]
From: Bob Hutchison (Sep 26 2007, at 19:30)
Did you intend to have a trailing space in your regex?
[link]
From: Tim (Sep 26 2007, at 20:05)
Bob: Yes, the space is intentional; if you read the chapter it explains why.
[link]
From: Paul (Sep 26 2007, at 23:42)
Maybe speeding it up using memoization for multi-cores??? (just a thought)
http://eigenclass.org/hiki/bounded-space-memoization
http://eigenclass.org/hiki/Hash+default+block
[link]
From: Caoyuan Deng (Sep 27 2007, at 02:20)
I wrote a solution in Erlang, with a parallel file data reading, plus a simple express matching, for a 1 million lines file (200M size), took 8.311s when -smp enable, and 10.206ms when smp disabled. The code is at:
http://blogtrader.net/page/dcaoyuan?entry=tim_bray_s_erlang_exercise
The solution spwan a lot of processes to parallel read the file to binary. Then send them to a collect_loop, collect_loop will buffered_read each chunk (when chunks order is correct), buffer_read will convert binary to small (4096 bytes here) lists, then scan_line will merge them to lines, and process_match on line.
[link]
From: Bruno (Sep 27 2007, at 15:28)
Entirely beside the point, but how is this beautiful code? It's a common regexp-driven throw-away script! In fact, you don't even need to start an editor to obtain this kind of information if you know the uniq command. E.g. with your data file:
$ egrep -o "GET /ongoing/When/[^ .]+ " o10k.ap | sort | uniq -c | sort -rn | head -n 10
89 GET /ongoing/When/200x/2006/09/29/Dynamic-IDE
20 GET /ongoing/When/200x/2006/07/28/Open-Data
13 GET /ongoing/When/200x/2003/07/25/NotGaming
8 GET /ongoing/When/200x/2006/01/31/Data-Protection
8 GET /ongoing/When/200x/2003/10/16/Debbie
8 GET /ongoing/When/200x/2003/09/18/NXML
7 GET /ongoing/When/200x/2003/06/23/SamsPie
6 GET /ongoing/When/200x/2006/09/11/Making-Markup
6 GET /ongoing/When/200x/2006/09/07/JRuby-guys
6 GET /ongoing/When/200x/2006/01/08/No-New-XML-Languages
Also, I'm a little offended by the phrase "born in Awk, perfected in Perl". Awk is a great example of a little Unix language, one that's specifically tailored for the task of processing records. If it featured a more contemporary regexp library, the Awk version of this script would approximate perfection. Perl (and Ruby for that matter) are a step back in the sense that they are general-purpose languages that try to fit everything in.
[link]
From: RJSJr (Sep 27 2007, at 21:46)
For the map-reduce algorithm in Ruby have you looked at Starfish?
Here is the home page for the project. It looks unmaintained for the past year but may still be worth the time and effort.
A little overview page is at http://tech.rufy.com/2006/08/mapreduce-for-ruby-ridiculously-easy.html
Google "ruby starfish" for additional info.
[link]
From: Arthuro Toscano (Sep 30 2007, at 01:58)
Here is IBM's “Java-killer“, Mr. Bray
http://x10.sourceforge.net/x10home.shtml
[link]
From: Mark Hoemmen (Sep 30 2007, at 11:10)
I've written a bunch of parallel performance-oriented code for my work as well as in classes (both taking and teaching). I've found that it helps to model performance first before parallelizing. In this case, you're probably running into disk bandwidth limitations: you're on only one disk (and a laptop disk too), and not a RAID array set up to stripe across disks. Your disk can't feed your cores with data fast enough to keep them busy.
Here's a rough performance estimation. Let's say you can read data from disk into RAM at about 60 MB/s. You can stream data from RAM into your cores at around 10 GB/s. If you have 8 cores, say, then each core can read data at a rate of around 1.2 GB/s. (We assume they don't need to compute much or write much data, say for a word count.) This means that as long as there's one disk feeding all the cores, you'll never been able to feed them with data fast enough to keep up with how fast they can process it. You'll be disk-bound unless the computation per line is high enough (say for a hairy regular expression) to slow down the cores commensurately.
As mentioned in the comment stream, the right way to solve this problem is to increase bandwidth by setting up a RAID array with several disks. (Incidentally, this was much how the earlier vector supercomputers, like the Cray 1, were able to get such high memory bandwidth: they used "banked" memory, which striped the data across several memory units.) Parallelizing the computation will only help if the computation itself is very demanding.
[link]
From: that osd guy (Oct 02 2007, at 18:50)
Since the limitation is on disk bandwidth has anyone considered the "RamDisk" approach?
[link]
From: Dave Thomas (Oct 04 2007, at 12:31)
Tim:
FWIW, you can speed this up by reading the initial file into memory, then processing it. Doesn't address your parallelism question, but...
Here's a version that runs on a 2,560,000 line file based on your test data. It takes 3.6s, compared to 4.6 for the original.
Dave
<pre>
counts = Hash.new(0)
RE = %r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) }
data = ARGF.read
data.scan(RE) { |ongoing| counts[ongoing] += 1 }
keys_by_count = counts.keys.sort_by { |a| -counts[a] }
keys_by_count[0 .. 9].each do |key|
puts "#{counts[key]}: #{key}"
end
</pre>
[link]
From: Ralph Corderoy (Oct 09 2007, at 03:38)
The example Ruby code doesn't properly parse the log file, e.g. it matches a GET in the Referer field too. If that doesn't matter for your needs then this is fine and allows others to use Python's findall() on a whole chunk of the file without regards for lines. But I think it would be interesting to compare *correct* parsers.
[link]
From: Will (Oct 09 2007, at 05:35)
In what way is logfile processing interesting? Sincere question.
[link]
From: nikuro (Oct 16 2007, at 03:27)
There may be a slight speed-up by avoiding back referencing (by using ?: or ?>):
%r{GET /ongoing/When/\d\d\dx/(?:\d\d\d\d/\d\d/\d\d/[^ .]+) }
%r{GET /ongoing/When/\d\d\dx/(?>\d\d\d\d/\d\d/\d\d/[^ .]+) }
[link]
From: nikuro (Oct 16 2007, at 05:50)
Oops! Should read:
%r{GET /ongoing/When/\d\d\dx/((?:\d\d\d\d/\d\d/\d\d/[^ .]+)) }
%r{GET /ongoing/When/\d\d\dx/((?>\d\d\d\d/\d\d/\d\d/[^ .]+)) }
[link]
From: Todd (Oct 16 2007, at 14:16)
your regex doesn't normalize for query strings, so you could potentially have a near-infinite amount of unique urls and stuffing them all into a hash isn't very memory efficient. if you can live with an approximate top-k listing, consider lossy counting.
[link]
From: Jordan Callicoat (Oct 16 2007, at 23:25)
A Simple Haskell version using a (very) simple, unchecked recursive descent, rather than regex:
<code><pre>
module main where
import Data.Maybe
import System (getArgs)
import Control.Monad (forM_)
import qualified Data.Map as M
import Data.List (foldl', sortBy)
import qualified Data.ByteString.Lazy.Char8 as LB
import KMP (kmpMatchLL) -- Justin Bailey, et al's KMP search for Lazy ByteStrings
countLines :: LB.ByteString -> M.Map LB.ByteString Int
countLines = foldl' count M.empty . LB.lines where
count m line =
let matches = kmpMatchLL (LB.pack "/ongoing/When/") line
uri = if (not. null) matches
then let uri1 = LB.drop ((matches !! 0) + 19) line
uri2 = LB.take (fromJust (LB.elemIndex ' ' uri1)) uri1
in case LB.elemIndex '.' uri2 of
Just idx -> LB.take idx uri2
Nothing -> uri2
else LB.empty
in if (not. LB.null) uri
then M.insertWith' (+) uri 1 m
else m
main :: IO ()
main = do
args <- getArgs
forM_ args $ \name -> (fmap countLines . LB.readFile) name >>= \m ->
let mp = ((take 10 . sortBy (flip compare `on` snd) . M.toList) m) in
mapM_ (putStrLn . (\x -> (show $ snd x) ++ ": " ++ (LB.unpack $ fst x))) mp
where on f g x y = g x `f` g y
</pre></code>
[link]
From: Sean O'Rourke (Nov 08 2007, at 20:25)
I was underwhelmed by the Perl submission. Here's a straightforward version that is significantly faster on my test laptop:
http://www.cs.ucsd.edu/~sorourke/wf.pl
[link]
From: Aristotle Pagaltzis (Nov 09 2007, at 07:59)
Sean O’Rourke’s submission seems to be buggy; the output doesn’t line up with a doublecheck I obtained using the following command:
LC_CTYPE=C egrep -o 'GET /ongoing/When/[0-9]{3}x/[0-9]{4}/[0-9]{2}/[0-9]{2}/[^ .]+' o10k.ap | sort | uniq -c | sort -rn | head -10
[link]
From: Sean O'Rourke (Nov 10 2007, at 08:16)
@Aristotle: (1) You forgot a space at the end of your regex. (2) The order of results with the same count is not specified. (3) I stupidly printed out an extra line, which I fixed.
[link]
From: Paul W. Homer (Nov 15 2007, at 14:32)
Maybe I've missed something (that happens :-), but your asking for an up-to-date modern way to quickly parse an old fashioned logfile?
The crude containerization of log data into a streaming log format is a legacy of a text-based paradigm of processing data, isn't it? Dumping it in a database would be better. Dumping it into something like an indexed version of db would even be faster.
The best optimizations come from doing work as early as possible and then saving it to be re-used later. If your log files entries start life as 'parsed' why 'unparse' them into a file.
Paul.
http://theprogrammersparadox.blogspot.com/
[link]
From: Wes Felter (Nov 17 2007, at 15:10)
Some philosophical thoughts on the big picture:
http://wmf.editthispage.com/2007/11/17
[link]